{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 交叉验证\n",
    "\n",
    "     CrossValidator将数据集划分为若干子集分别地进行训练和测试\n",
    " \n",
    "     注意对参数网格进行交叉验证的成本是很高的。如下面例子中，参数网格hashingTF.numFeatures有3个值，lr.regParam有2个值，CrossValidator使用2折交叉验证。这样就会产生(3*2)*2=12中不同的模型需要进行训练。在实际的设置中，通常有更多的参数需要设置，且我们可能会使用更多的交叉验证折数（3折或者10折都是经使用的）。所以CrossValidator的成本是很高的，尽管如此，比起启发式的手工验证，交叉验证仍然是目前存在的参数选择方法中非常有用的一种。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml import Pipeline\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.ml.feature import HashingTF, Tokenizer\n",
    "from pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n",
    " \n",
    "# Prepare training documents, which are labeled.\n",
    "training = spark.createDataFrame([\n",
    "    (0, \"a b c d e spark\", 1.0),\n",
    "    (1, \"b d\", 0.0),\n",
    "    (2, \"spark f g h\", 1.0),\n",
    "    (3, \"hadoop mapreduce\", 0.0),\n",
    "    (4, \"b spark who\", 1.0),\n",
    "    (5, \"g d a y\", 0.0),\n",
    "    (6, \"spark fly\", 1.0),\n",
    "    (7, \"was mapreduce\", 0.0),\n",
    "    (8, \"e spark program\", 1.0),\n",
    "    (9, \"a e c l\", 0.0),\n",
    "    (10, \"spark compile\", 1.0),\n",
    "    (11, \"hadoop software\", 0.0)\n",
    "], [\"id\", \"text\", \"label\"])\n",
    " \n",
    "# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.\n",
    "tokenizer = Tokenizer(inputCol=\"text\", outputCol=\"words\")\n",
    "hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol=\"features\")\n",
    "lr = LogisticRegression(maxIter=10)\n",
    "pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])\n",
    " \n",
    "# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.\n",
    "# This will allow us to jointly choose parameters for all Pipeline stages.\n",
    "# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.\n",
    "# We use a ParamGridBuilder to construct a grid of parameters to search over.\n",
    "# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,\n",
    "# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.\n",
    "paramGrid = ParamGridBuilder() \\\n",
    "    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \\\n",
    "    .addGrid(lr.regParam, [0.1, 0.01]) \\\n",
    "    .build()\n",
    " \n",
    "crossval = CrossValidator(estimator=pipeline,\n",
    "                          estimatorParamMaps=paramGrid,\n",
    "                          evaluator=BinaryClassificationEvaluator(),\n",
    "                          numFolds=2)  # use 3+ folds in practice\n",
    " \n",
    "# Run cross-validation, and choose the best set of parameters.\n",
    "cvModel = crossval.fit(training)\n",
    " \n",
    "# Prepare test documents, which are unlabeled.\n",
    "test = spark.createDataFrame([\n",
    "    (4, \"spark i j k\"),\n",
    "    (5, \"l m n\"),\n",
    "    (6, \"mapreduce spark\"),\n",
    "    (7, \"apache hadoop\")\n",
    "], [\"id\", \"text\"])\n",
    " \n",
    "# Make predictions on test documents. cvModel uses the best model found (lrModel).\n",
    "prediction = cvModel.transform(test)\n",
    "selected = prediction.select(\"id\", \"text\", \"probability\", \"prediction\")\n",
    "for row in selected.collect():\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模型选择和超参数调整\n",
    "在机器学习中非常重要的任务就是模型选择，或者使用数据来找到具体问题的最佳的模型和参数，这个过程也叫做调试。调试可以在独立的如逻辑回归等估计器中完成，也可以在包含多样算法、特征工程和其他步骤的管线中完成。用户应该一次性调试整个管线，而不是独立的调整管线中的每个组成部分。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练验证分裂\n",
    "       除了交叉验证以外，Spark还提供训练验证分裂用以超参数调整。和交叉验证评估K次不同，训练验证分裂只对每组参数评估一次。因此它计算代价更低，但当训练数据集不是足够大时，其结果可靠性不高。\n",
    "\n",
    "      与交叉验证不同，训练验证分裂仅需要一个训练数据与验证数据对。使用训练比率参数将原始数据划分为两个部分。如当训练比率为0.75时，训练验证分裂使用75%数据以训练，25%数据以验证。\n",
    "\n",
    "     与交叉验证相同，确定最佳参数表后，训练验证分裂最后使用最佳参数表基于全部数据来重新拟合估计器。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import RegressionEvaluator\n",
    "from pyspark.ml.regression import LinearRegression\n",
    "from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit\n",
    " \n",
    "# Prepare training and test data.\n",
    "data = spark.read.format(\"libsvm\")\\\n",
    "    .load(\"data/mllib/sample_linear_regression_data.txt\")\n",
    "train, test = data.randomSplit([0.7, 0.3])\n",
    "lr = LinearRegression(maxIter=10, regParam=0.1)\n",
    " \n",
    "# We use a ParamGridBuilder to construct a grid of parameters to search over.\n",
    "# TrainValidationSplit will try all combinations of values and determine best model using\n",
    "# the evaluator.\n",
    "paramGrid = ParamGridBuilder()\\\n",
    "    .addGrid(lr.regParam, [0.1, 0.01]) \\\n",
    "    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\\\n",
    "    .build()\n",
    " \n",
    "# In this case the estimator is simply the linear regression.\n",
    "# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.\n",
    "tvs = TrainValidationSplit(estimator=lr,\n",
    "                           estimatorParamMaps=paramGrid,\n",
    "                           evaluator=RegressionEvaluator(),\n",
    "                           # 80% of the data will be used for training, 20% for validation.\n",
    "                           trainRatio=0.8)\n",
    " \n",
    "# Run TrainValidationSplit, and choose the best set of parameters.\n",
    "model = tvs.fit(train)\n",
    "# Make predictions on test data. model is the model with combination of parameters\n",
    "# that performed best.\n",
    "prediction = model.transform(test)\n",
    "for row in prediction.take(5):\n",
    "    print(row)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
